Skip to content

Conversation

@farzadab
Copy link
Contributor

As reported in #2559, reading from HF Hub datasets can sometimes lead to 500 Internal Server Errors, which even if they're rare, can cause issues on large training runs.

The source of this error is not yet identified, but a simple fix seems to be to add the missing 500 error code to the retry list.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 26, 2024

hI @farzadab, thanks for the PR. http_backoff is called 3 times in hf_file_system.py to download files in different ways. Could you update all 3 of them? Thanks!

@farzadab
Copy link
Contributor Author

My bad. Done.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good now :)

For the record, we've discussed this change internally (private slack). We had some concerns that retrying on HTTP 500 without knowing the root cause is clunky (can be transient error or not) but since:

  • we retry only up to 5 times (so ~1min)
  • and only when using hf_file_system so basically when streaming a dataset from the Hub

Then it's fine do to it. The cost of breaking a training process because of transient HTTP 500 is higher than when breaking a download process when instantiating a model once.

All of this to say, it's good to merge! 🤗

@Wauplin Wauplin merged commit 476fa0b into huggingface:main Sep 27, 2024
16 checks passed
@farzadab farzadab deleted the patch-1 branch September 30, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants